Skip to content

Minimal query-aware statistics request hooks (extracted from #21996)#22300

Draft
adriangb wants to merge 5 commits into
apache:mainfrom
pydantic:statistics-request-hooks
Draft

Minimal query-aware statistics request hooks (extracted from #21996)#22300
adriangb wants to merge 5 commits into
apache:mainfrom
pydantic:statistics-request-hooks

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 17, 2026

Problem statement

ListingTable currently loads all statistics for Parquet files when an external table is created. This makes sense in the context of how ClickBench counts table loading (statistics loading at table initialization is not counted towards query time) but it has drastic problems for other systems:

  • The TableProvider has no idea what statistics are useful or not. All it knows is what columns are being projected and filtered on. But e.g. for select * from t you want to get no stats but for select * from t order by col you'd want to get just stats for col.
  • Real world systems can have multi TB tables with tens of thousands of files and serve multi-tenant queries so they have thousands of tables. Collecting stats upfront is not possible. It also results in memory bloat: holding stats for all columns on a wide table for 100k files would itself be larger than memory. Even with a limited size cache (see e.g. Add limit to DefaultFileStatisticsCache #19052) you'd end up just churning the cache.
  • For systems with an external metadata store / catalog that can provide more granular statistics requesting unnecessary stats is wasted work. For Parquet it is more or less a sunk cost (you have to do the IO for the entire footer) but if your system is able to do more granular IO it becomes wasted work.

Proposed solution

Long term I'd like to move towards #21996.
This is the first minimal step to enable it: allow statistics requests to be bound to a scan so that a TableProvider can know what statistics to gather.

It still needs to populate the dense / wasteful statistics data structures, ListingTable doesn't use this and there is no optimizer rule to decide what stats should be collected. Users have to write this all themselves, but at least now it will be possible. Currently it is not possible because there is no way to communicate from a logical optimizer rule into a TableScan::scan_with_args call.

Alternatives considered

Make this an opaque extension and put extensions: Extensions onto ScanArgs.
I think the API being proposed here is reasonable and has the opportunity to grow into something that DataFusion actually uses. Erasing the types / entire API obfuscates the functionality, I'd rather not do that.

Next steps

Two straightforward followups would be to:

  1. Add a logical optimizer rule to populate statistics requests by introspecting plans. I think there are a lot of obvious cases that could be covered, and users can always add their own.
  2. Have ListingTable use it. I spoke with @alamb about this and it's not clear that this will be a win: DataFusion implicitly creates a table for a query like select col from 't.parquet'; and at table creation time we do not know the query. I'd argue that if we are collecting stats for a single query we should tailor them to that query, but I also concede that there isn't that much benefit in terms of runtime given the sunk cost for Parquet (there is still a memory usage benefit). Since we don't use ListingTable I am unlikely to push this.

Which issue does this PR close?

What changes are included in this PR?

Five small, independently-reviewable commits:

  1. refactor: add TableScanBuilder, deprecate TableScan::try_newTableScan::try_new takes five positional args and bare TableScan { .. } literals are fragile to field additions. Introduce TableScanBuilder (with From<TableScan>), move schema derivation into build(), deprecate try_new (delegates to the builder), migrate all in-tree callers. Pure refactor.
  2. feat: add StatisticsRequest — new public vocabulary types in datafusion-expr-common::statistics. Nothing consumes them yet. I left out any sort of response / result types.
  3. feat: add TableScan::statistics_requests field — an advisory Vec<StatisticsRequest> on TableScan, settable via TableScan::with_statistics_requests / TableScanBuilder. Empty by default; DataFusion's own rules never populate it.
  4. feat: thread statistics requests into ScanArgsScanArgs gains statistics_requests; the physical planner threads TableScan::statistics_requests into it so the request reaches TableProvider::scan_with_args.
  5. test: e2e statistics-request flow via a custom optimizer rule — an integration test playing both external roles.

Deliberately left out vs #21996: the built-in RequestStatistics optimizer rule, the FilePruner / ListingTable consumer integration, the PartitionedFile::satisfied_stats per-file response field, and StatisticsValue::Distribution (which would depend on the now-deprecated Distribution type).

Are these changes tested?

Yes:

  • datafusion-expr-common: a unit test that StatisticsRequest is hashable / usable as a HashMap key.
  • datafusion/core/tests/user_defined/statistics_requests.rs: an end-to-end integration test where a custom OptimizerRule annotates TableScan and a custom TableProvider asserts the requests reach scan_with_args — plus a test that without such a rule the provider sees an empty request list.
  • All existing datafusion-expr / datafusion-optimizer / datafusion-proto tests pass against the TableScanBuilder refactor.

Are there any user-facing changes?

Yes — this needs the api change label:

  • New public types StatisticsRequest, StatisticsValue, SatisfiedStatistics (re-exported via datafusion_expr::statistics).
  • New TableScanBuilder; TableScan::try_new is deprecated (still works, delegates to the builder).
  • TableScan gains a new public field statistics_requests — this breaks exhaustive TableScan { .. } struct literals downstream (the recommended fix is TableScanBuilder).
  • ScanArgs gains with_statistics_requests / statistics_requests.

🤖 Generated with Claude Code

`TableScan::try_new` takes five positional arguments and bare
`TableScan { .. }` struct literals are scattered across the codebase,
making both fragile to field additions.

Introduce `TableScanBuilder` (with `From<TableScan>`, so an existing
scan can be decomposed, tweaked, and rebuilt) and move the
schema-derivation logic into `build()`. `TableScan::try_new` is now
deprecated and delegates to the builder; all in-tree callers are
migrated to the builder. Pure refactor, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate catalog Related to the catalog crate proto Related to proto crate labels May 17, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 17, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  99.723s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.038s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  96.804s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.037s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.972s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 199.647s] datafusion
    Building datafusion-catalog v53.1.0 (current)
       Built [  36.372s] (current)
     Parsing datafusion-catalog v53.1.0 (current)
      Parsed [   0.029s] (current)
    Building datafusion-catalog v53.1.0 (baseline)
       Built [  36.157s] (baseline)
     Parsing datafusion-catalog v53.1.0 (baseline)
      Parsed [   0.028s] (baseline)
    Checking datafusion-catalog v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.185s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  74.286s] datafusion-catalog
    Building datafusion-expr v53.1.0 (current)
       Built [  25.694s] (current)
     Parsing datafusion-expr v53.1.0 (current)
      Parsed [   0.078s] (current)
    Building datafusion-expr v53.1.0 (baseline)
       Built [  25.474s] (baseline)
     Parsing datafusion-expr v53.1.0 (baseline)
      Parsed [   0.079s] (baseline)
    Checking datafusion-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.723s] 222 checks: 220 pass, 2 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2790
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2790

--- failure type_method_marked_deprecated: type method #[deprecated] added ---

Description:
A type method is now #[deprecated]. Downstream crates will get a compiler warning when using this method.
        ref: https://doc.rust-lang.org/reference/attributes/diagnostics.html#the-deprecated-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/type_method_marked_deprecated.ron

Failed in:
  method datafusion_expr::logical_plan::TableScan::try_new in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2866
  method datafusion_expr::TableScan::try_new in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2866

     Summary semver requires new major version: 1 major and 1 minor checks failed
    Finished [  54.417s] datafusion-expr
    Building datafusion-expr-common v53.1.0 (current)
       Built [  18.440s] (current)
     Parsing datafusion-expr-common v53.1.0 (current)
      Parsed [   0.019s] (current)
    Building datafusion-expr-common v53.1.0 (baseline)
       Built [  18.046s] (baseline)
     Parsing datafusion-expr-common v53.1.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.259s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  37.884s] datafusion-expr-common
    Building datafusion-optimizer v53.1.0 (current)
       Built [  25.660s] (current)
     Parsing datafusion-optimizer v53.1.0 (current)
      Parsed [   0.030s] (current)
    Building datafusion-optimizer v53.1.0 (baseline)
       Built [  25.901s] (baseline)
     Parsing datafusion-optimizer v53.1.0 (baseline)
      Parsed [   0.032s] (baseline)
    Checking datafusion-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.228s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  53.307s] datafusion-optimizer
    Building datafusion-proto v53.1.0 (current)
       Built [  54.052s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.143s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  53.990s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.147s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   2.429s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 113.019s] datafusion-proto

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 17, 2026
@adriangb adriangb added the api change Changes the API exposed to users of the crate label May 17, 2026
@adriangb adriangb force-pushed the statistics-request-hooks branch from 0ef815a to e2838e8 Compare May 17, 2026 15:40
@adriangb adriangb requested a review from alamb May 17, 2026 15:48
@adriangb
Copy link
Copy Markdown
Contributor Author

cc @asolimando

adriangb and others added 4 commits May 17, 2026 08:52
Add `StatisticsRequest` to `datafusion-expr-common::statistics` — a small
vocabulary for query-aware statistics: a caller can ask a provider for a
specific statistic (Min/Max/NullCount/DistinctCount/Sum/ByteSize per
column, plus RowCount and TotalByteSize) instead of for a dense
`Statistics` covering every column.

It is intentionally just a vocabulary; nothing in DataFusion populates or
consumes it yet. Re-exported via `datafusion_expr::statistics`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an advisory `statistics_requests: Vec<StatisticsRequest>` field to
`TableScan`. A custom optimizer rule can attach the statistics the
surrounding plan shape would benefit from (e.g. Min/Max for sort keys)
via `TableScan::with_statistics_requests` or the new
`TableScanBuilder::with_statistics_requests`; the physical planner will
thread them into the table provider (next commit).

The field is empty by default and DataFusion's own rules never populate
it. `Debug`/`PartialEq`/`Eq`/`Hash`/`PartialOrd` for `TableScan` are
left unchanged — it is advisory metadata, not part of plan identity.

`map_expressions` in `tree_node.rs` is rewritten to rebuild `TableScan`
via `..scan` instead of an exhaustive destructure, so it carries this
(and any future) field through untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a `statistics_requests` field to `ScanArgs` (with
`with_statistics_requests` / `statistics_requests` accessors) and have
the physical planner thread `TableScan::statistics_requests` into it.

This completes the request-side path: a custom optimizer rule annotates
`TableScan`, and the request reaches a custom `TableProvider` in
`scan_with_args`. DataFusion's own providers ignore the field; the
default `ScanArgs` value is an empty slice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a self-contained integration test that plays both external roles:
a custom `OptimizerRule` annotates each `TableScan` with
`StatisticsRequest`s, and a custom `TableProvider` records the
`ScanArgs::statistics_requests` it receives in `scan_with_args`.

This demonstrates the request-side hooks are sufficient to build the
feature entirely outside of DataFusion. A second test confirms that
without such a rule the provider sees an empty request list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the statistics-request-hooks branch from e2838e8 to 070700d Compare May 17, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate auto detected api change Auto detected API change catalog Related to the catalog crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant